time UNIX command in front of other commands. This command returns the processing time, in minutes and seconds, which simply helps plan future data analyses. For example:2014MISEQmv if working directly in UNIX or Macqiimemkdir 2014MISEQ #!/bin/bash #save this as a script called fixfilenames.sh and save in your Project folder
cd $1;
for file in *L001*; do
fp = "${file%L001*}";
lp = "${file#*L001}";
new = "${fp//_/.}L001$lp";
mv "$file" "$new";
donecd 2014MISEQ
chmod u+x fixfilenames.sh
./fixfilenames.shmultiple_join_paired_ends.py -i /home/qiime/2014MISEQ -o /home/qiime/2014MISEQ/joined_seqs
#took 10 secs with test setjoined_seqscd joined_seqs #change directory to the newly created joined_seqs directory
find . -name "*.un1.fastq" -type f -delete #do this exactly as written to avoid deleting other files
find . -name "*.un2.fastq" -type f -delete-s (min_qual_score) and -l (min_seq_length) options. Our expected length at this step is 211 bp.cd .. #change directory back to the main working directory
multiple_split_libraries_fastq.py -i /home/qiime/2014MISEQ/joined_seqs -o /home/qiime/2014MISEQ/split_out --demultiplexing_method sampleid_by_file --include_input_dir_path --remove_filepath_in_name #took 16 secs with test setsplit_outvalidate_mapping_file.py -m SampleMapFile.txt #good practice to make sure all columns are in correct format
truncate_reverse_primer.py -f ./split_out/seqs.fna -m ./SampleMapFile.txt -z truncate_remove -M 1 -o ./Rtruncatedvalidate_mapping_file.py:adjust_seq_orientation.py script. We used the FASTX command because we will use another FASTX tool later in this section.SampleMapFile_reverse.txt.fastx_reverse_complement -i ./Rtruncated/seqs_rev_primer_truncated.fna -o ./Rcomp
truncate_reverse_primer.py -f ./Rcomp -m ./SampleMapFile_reverse.txt -z truncate_remove -M 1 -o ./Rcomp_forward_removed
fastx_reverse_complement -i ./Rcomp_forward_removed/Rcomp_rev_primer_truncated.fna -o ./trimmed_seqs.fnatrimmed_seqs.fna with all primers removed.pick_otus.py -i ./trimmed_seqs.fna -m swarm --swarm_resolution 2 -o ./swarmOTUs #took 18 sec with test setswarmOTUstop UNIX command in a new terminal window. The pick-otus QIIME command should normally be at the top of the list and using the greatest %CPU.trimmed_seqs_otus.txt as input.make_otu_table.py -i ./swarmOTUs/trimmed_seqs_otus.txt -o ./swarmOTUs/seqs.biomWe filter out low abundance OTU clusters (< 10 copies in at least 1 sample of the entire data set) before writing data to the table using a script that calls the ‘pandas’ python package. We used Python 2.7.3 that should come as a dependency with the QIIME install. The provided ‘pandas_filter10.py’ script has to be in the swarmOTUs directory.
cd ./swarmOTUs #change to the new directory
cp ../pandas_filter10.py .
python pandas_filter10.py #This script will find the seqs.biom file created as output in the last stepexclude.txt that we can use in the next step to remove those low abundance OTU clusters.pick_rep_set.py script to pick one representative sequence from each cluster. Then we pull those desired representative sequences from our original .fna file and write a new .fna. From here on out, we use the representative sequences from each OTU cluster to reduce processing demand.cd .. #change back to the main working directory
filter_otus_from_otu_table.py -i ./swarmOTUs/seqs.biom -o ./swarmOTUs/otu_table_no_singletons.biom -e ./swarmOTUs/exclude.txt
biom convert -i ./swarmOTUs/otu_table_no_singletons.biom -o ./swarmOTUs/2014_biom_no_singletons.txt --table-type="OTU table" --to-tsv
pick_rep_set.py -i ./swarmOTUs/trimmed_seqs_otus.txt -f ./trimmed_seqs.fna -m most_abundant -o ./2014_final.fna -l ./2014final_log
filter_fasta.py -f ./2014_final.fna -b ./swarmOTUs/otu_table_no_singletons.biom -o ./2014_final_no_singletons.fnafasta_formatter command is part of the FASTX Toolkit. We run the command and then manually add column headers to the text file called: seqID and seqs.fasta_formatter -i ./2014_final_no_singletons.fna -o ./2014tabseqs.txt -t2014tabseqs.txt file will look like this:| seqID | seqs |
|---|---|
| denovo0 F10R12_2 | AATTTGAGCAG . . . . . |
| denovo1 F10R12_1121 | AATTTGAGCAGG . . . . . |
| denovo10 F1R13_166841 | AAGATGAGCTGG . . . . . |
| . . . . . . . . . | . . . . . |
| denovo789 F10R5_2482 | AAGATGCCTGGAA . . . . . |
time UNIX command in front of other commands. This command returns the processing time, in minutes and seconds, which simply helps plan future data analyses. For example:| File Name | Direction |
|---|---|
| 2015-03-20.IonXpress_001.fastq | forward |
| 2015-03-20.IonXpress_002.fastq | forward |
| 2015-03-20.IonXpress_003.fastq | forward |
| . . . . . . . . . | . . . |
| 2015-03-20.IonXpress_014.fastq | reverse |
| 2015-03-20.IonXpress_015.fastq | reverse |
2014IONforward and reverse, and put the appropriate IonXpress fastq files into each one. Next, we perform several steps to prepare all the forward FASTQ files, then repeat the process with the reverse FASTQ files, then put all files into 1 FASTA before picking OTUs.cd 2014ION
mkdir forward
mkdir reverse
mv *001* *002* *003* *004* *005* *006* *007* *008* *009* *010* ./forward
mv *011* *012* *013* *014* *015* ./reversecd forward
#Note that the 10bp forward barcode changes in each command correspond to each file
#The command searches the fastq file and moves the last 10bp to just after the forward barcode for each read
sed '2~4s/\(.*\)\(.\{10\}\)$/GTTACCTTAG\2\1/;' < 2015-03-20.IonXpress_001.fastq > F01_barcodes.fastq
sed '2~4s/\(.*\)\(.\{10\}\)$/GTTCTCCTTA\2\1/;' < 2015-03-20.IonXpress_002.fastq > F02_barcodes.fastq
sed '2~4s/\(.*\)\(.\{10\}\)$/GAATCCTCTT\2\1/;' < 2015-03-20.IonXpress_003.fastq > F03_barcodes.fastq
sed '2~4s/\(.*\)\(.\{10\}\)$/GATCTTGGTA\2\1/;' < 2015-03-20.IonXpress_004.fastq > F04_barcodes.fastq
sed '2~4s/\(.*\)\(.\{10\}\)$/GTTCCTTCTG\2\1/;' < 2015-03-20.IonXpress_005.fastq > F05_barcodes.fastq
sed '2~4s/\(.*\)\(.\{10\}\)$/GAACTTGCAG\2\1/;' < 2015-03-20.IonXpress_006.fastq > F06_barcodes.fastq
sed '2~4s/\(.*\)\(.\{10\}\)$/GAATCACGAA\2\1/;' < 2015-03-20.IonXpress_007.fastq > F07_barcodes.fastq
sed '2~4s/\(.*\)\(.\{10\}\)$/GTTATCGGAA\2\1/;' < 2015-03-20.IonXpress_008.fastq > F08_barcodes.fastq
sed '2~4s/\(.*\)\(.\{10\}\)$/GTTCCGCTCA\2\1/;' < 2015-03-20.IonXpress_009.fastq > F09_barcodes.fastq
sed '2~4s/\(.*\)\(.\{10\}\)$/GTTCGGTCAG\2\1/;' < 2015-03-20.IonXpress_010.fastq > F10_barcodes.fastqcat *_barcodes.fastq > forwardseqs.fastqconvert_fastaqual_fastq.py -f forwardseqs.fastq -c fastq_to_fastaqual #took ~25 seconds with test set forward directory: forwardseqs.fna and forwardseqs.qual. These are the new FASTA file (.fna) and the associated quality file (QUAL).SampleMapF.txt file is in the forward directory.cd .. #change back to 2014ION dir
validate_mapping_file.py -m SampleMapF.txt
head SampleMapF.txt-s (min_qual_score) and -l (min_seq_length) options. We also remove any sequences with a homopolymer run of 10 or more bp; there is a natural 8 bp conserved region in our target area. The expected length at this point is 234 bp.split_libraries.py -m SampleMapF.txt -f ./forward/forwardseqs.fna -q ./forward/forwardseqs.qual -o demultiplexedF -b 20 -H 10 -M 1 -s 20 -z truncate_remove #took 1 min, 7 sec with test setSampleMapR.txt.cd ./reverse
sed '2~4s/\(.*\)\(.\{10\}\)$/GATTCGAGGA\2\1/;' < 2015-03-20.IonXpress_011.fastq > R11_barcodes.fastq
sed '2~4s/\(.*\)\(.\{10\}\)$/GAACCACCTA\2\1/;' < 2015-03-20.IonXpress_012.fastq > R12_barcodes.fastq
sed '2~4s/\(.*\)\(.\{10\}\)$/GTCCGTTAGA\2\1/;' < 2015-03-20.IonXpress_013.fastq > R13_barcodes.fastq
sed '2~4s/\(.*\)\(.\{10\}\)$/GACACTCCAA\2\1/;' < 2015-03-20.IonXpress_014.fastq > R14_barcodes.fastq
sed '2~4s/\(.*\)\(.\{10\}\)$/GACCTCTAGA\2\1/;' < 2015-03-20.IonXpress_015.fastq > R15_barcodes.fastq
cat *_barcodes.fastq > reverseseqs.fastq
convert_fastaqual_fastq.py -f reverseseqs.fastq -c fastq_to_fastaqual
#creates 2 new files: reverseseqs.fna and reverseseqs.qual
cd ..
validate_mapping_file.py -m SampleMapR.txt
split_libraries.py -m SampleMapR.txt -f ./reverse/reverseseqs.fna -q ./reverse/reverseseqs.qual -o demultiplexedR -b 20 -H 10 -M 1 -s 20 -z truncate_remove #if you get a high number of mismatches and sequences are discarded, double check that you have the right mapping fileadjust_seq_orientation.py script. We used the FASTX command because we will use another FASTX tool later in this section.cd demultiplexedF
mv seqs.fna Fseqs.fna
mv Fseqs.fna ..
cd ..
cd ./demultiplexedR
mv seqs.fna Rseqs.fna
mv Rseqs.fna ..
cd ..
fastx_reverse_complement -i ./Rseqs.fna -o ./Rseqs_flipped.fna
cat Fseqs.fna Rseqs_flipped.fna > 2014ionseqs.fnapick_otus.py -i ./2014ionseqs.fna -m swarm --swarm_resolution 2 -o ./swarmOTUs #took 1.8 sec with test setswarmOTUstop UNIX command in a new terminal window. The pick-otus QIIME command should be at the top of the list and using the greatest %CPU.2014ionseqs_otus.txt as input.make_otu_table.py -i ./swarmOTUs/2014ionseqs_otus.txt -o ./swarmOTUs/seqs.biomswarmOTUs directory.cd ./swarmOTUs #change to the new directory
cp ../pandas_filter10.py .
python pandas_filter10.py #This script will find the seqs.biom file created as output in the last stepexclude.txt that we can use in the next step to remove those low abundance OTU clusters.pick_rep_set.py script to pick one representative sequence from each cluster. Then we pull those desired representative sequences from our original .fna file and write a new .fna. From here on out, we use the representative sequences from each OTU cluster to reduce processing demand.
cd .. #change back to the main working directory
filter_otus_from_otu_table.py -i ./swarmOTUs/seqs.biom -o ./swarmOTUs/otu_table_no_singletons.biom -e ./swarmOTUs/exclude.txt
biom convert -i ./swarmOTUs/otu_table_no_singletons.biom -o ./swarmOTUs/2014ion_biom_no_singletons.txt --table-type="OTU table" --to-tsv
pick_rep_set.py -i ./swarmOTUs/2014ionseqs_otus.txt -f ./2014ionseqs.fna -m most_abundant -o ./2014ion_final.fna -l ./2014ion_final_log
filter_fasta.py -f ./2014ion_final.fna -b ./swarmOTUs/otu_table_no_singletons.biom -o ./2014ion_final_no_singletons.fnafasta_formatter command is part of the FASTX Toolkit. We run the command and then manually add column headers to the text file called: seqID and seqs.fasta_formatter -i ./2014ion_final_no_singletons.fna -o ./2014iontabseqs.txt -t2014tabseqs.txt and the 2014_biom_no_singletons.txt files to a working directory outside of the QIIME system, we used several R packages to assign taxonomy and filter out unwanted returns, such as low similarity matches and those outside a reasonable geographic area.2014iontabseqs.txt and 2014_ion_biom_no_singletons.txtinstall.packages('rmarkdown')
library ('rmarkdown')setwd("D:\\Google Drive\\IONvsMISEQ")
mydata <- read.table("2014tabseqs.txt", header = TRUE, sep = "\t", dec = ".", stringsAsFactors = FALSE)
head(mydata)## denovo0.A4900.S9.L001_R1_001_0
## 1 denovo1 A4593.S13.L001_R1_001_116544
## 2 denovo10 A4900.S9.L001_R1_001_3230
## 3 denovo11 A4900.S9.L001_R1_001_483
## 4 denovo12 A4593.S13.L001_R1_001_117167
## 5 denovo13 A4593.S13.L001_R1_001_116755
## 6 denovo14 A4900.S9.L001_R1_001_2085
## AGATATTGGAACTTTATATTTTATTTTTGGAATTTGAGCAGGAATAGTAGGAACATCTTTAAGACTTTTAATTCGAGCTGAATTAGGTAATCCAGGATCTCTAATTGGAGATGACCAAATTTATAATACCATTGTAACAGCTCATGCTTTTATTATAATTTTTTTCATAGTTATACCTATTATAATTGGAGGATTTGGAAATTGATTAGTA
## 1 AGATATTGGAACTTTATATTTTATTTTTGGTATTTGATCTGGTATAGTAGGAACTTCATTAAGATTATTAATTCGGGCAGAATTAGGAAATCCTGGATCATTAATTGGTGATGACCAAATTTATAATACTATTGTTACAGCCCATGCTTTTATTATAATTTTTTTTATGGTAATACCCATTATAATTGGAGGATTTGGAAATTGATTAGTA
## 2 AGATATTGGAACTTTATATTTTATTTTTGGGGCATGATCAGGTATAATTGGTACATCTATAAGATTAATAATTCGTACTGAATTAGGAAATTATGGATCTTTAATTGGTGATGATCAAATTTATAATGTTATTGTTACAGCCCATGCTTTTATTATAATTTTCTTTATAGTTATACCTATTATAATTGGAGGATTTGGTAATTGATTAGTA
## 3 AGATATTGGAACTTTATATTTTATTTTTGGAGCTTGGGCAGGAATGGTTGGAACTTCGTTGAGAATTTTAATTCGGGCAGAACTTGGACATCCAGGGGCATTAATTGGTGATGATCAAATTTATAACGTAATTGTAACAGCTCATGCTTTTATCATAATTTTCTTTATAGTAATACCTATTATAATTGGAGGATTTGGAAATTGATTAGTA
## 4 AGATATTGGAACTTTATATTTTATTTTTGGGGCTTGGGCAGGAATAGTTGGAACTTCATTAAGAATTTTAATTCGAGCAGAACTTGGTCATCCGGGGGCACTAATTGGTGATGATCAAATTTATAATGTTATTGTAACAGCTCATGCATTTATTATAATTTTTTTTATAGTAATACCCATTATAATTGGAGGATTTGGAAATTGATTAGTA
## 5 AGATATTGGAACTTTATATTTTATTTTTGGTGCATGATCAGGTATAGTAGGAACATCTTTAAGTATACTAATTCGAGCTGAATTAAATCAACCAGGATCTTTAATTGGAGATGATCAAATCTATAATGTAATTGTAACAGCCCATGCATTTATTATAATTTTTTTCATAGTTATACCAATTTTAATTGGAGGATTTGGAAATTGATTAGTA
## 6 AGATATTGGAACTTTATATTTTATTTTTGGAGCATGATCAGGTGTAATTGGTACATCTATAAGATTAATAATTCGTACTGAATTAGGAAATTCTGGGCCTTTAATTGGTGATGACCAAAATTTATAATGTTATTGTTACAGCCCATGCTTTTATTATATTTTTCTAGAGTTATATACCTATTATAATTGGAGGATTTGGAAATTGATTAGTA
mydata2 <- as.list(setNames(mydata$seqs, mydata$seqID)) #make sure headers are not capitalizedbold_identify function to get sequences from the BOLD API. This will return all the matches in BOLD, which should also include sequences mined from GenBank (https://www.ncbi.nlm.nih.gov/genbank/).install.packages('bold')
library('bold')
output <- bold_identify(sequences = mydata2, db = "COX1", response=FALSE) #This can take several hours to runouttax40 <- lapply(output, head, n=40)
outtaxframe <- do.call("rbind", lapply(outtax40, data.frame))outtaxframe returns the top 40 matches, by similarity, from the BOLD results.#We used a custom header style for the final xlsx file
install.packages('openxlsx')
library('openxlsx') #Must have Rtools installed and check box to edit PATH or afterwards do: Sys.setenv("R_ZIPCMD" = "C:/Rtools/bin/zip.exe") ##path to zip.exe
HS <- createStyle(fontSize=13, fontColour='navy', numFmt='GENERAL', halign='center', valign='center', textDecoration='bold', wrapText=TRUE)install.packages('dplyr')
library('dplyr')
library('tibble')
outtaxframe %<>% #This only keeps rows that we want and updates the dataframe
rownames_to_column("seqID") %>%
filter(specimen_country %in% c("United States", "Canada"), similarity >= 0.98)
write.xlsx(outtaxframe, file='2014outtaxframe40.xlsx', asTable=FALSE, colNames=TRUE, rowNames=TRUE, headerStyle=HS)
#Might get an error if Rtools is not installed - follow error message suggestions to get Rtools2014outtaxframe40.xlsx, 2014tabseqs.txt, and 2014final_no_singletons.txt.2014biom_no_singletons.txt file and convert any within-sample occurrences < 10 to 0 to remove low abundance OTU occurrences with simple find and replace commands in MS Excel. This step is to remove potential sequencing errors; the threshold of 10 is arbitrary and can be changed depending on project objectives.2014outtaxframe40.xlsx and , and look for any discrepancies at the same similarity in our output from bold_identify, and then assign taxonomy in new columns in the 2014final_no_singletons.txt file.bold_identify that we need to manually investigate.2014outtaxframe40.xlsx based on similarity and geography. Nonetheless, we chose a simple example to demonstrate the process of how we learned to convert results into assigned taxonomy.2014tabseqs.txt file to find the sequence for denovo118